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Abstract 

Reduction  in  the  size  and  complexity  of  neural  network  is  essential  to  im- 
prove generalization,  reduce  training  error,  and  improve  network  speed.  Most 
of  the  known  optimization  methods  heavily  rely  on  weight  sharing  concepts 
for  pattern  separation  and  recognition.  The  method  presented  here  focuses  on 
network  topology  and  information  content  for  optimization.  We  have  studied 
the  change  in  the  network  topology  and  its  effects  on  information  content  dy- 
namically during  the  optimization  of  the  network.  The  changes  in  the  network 
topology  were  achieved  by  altering  the  number  of  nonzero  weights.  The  pri- 
mary optimization  is  scaled  conjugate  gradient  and  the  secondary  method  of 
optimization  a Boltzmann  method.  The  conjugate  gradient  optimization  serves 
as  a connection  creation  operator  and  the  Boltzmann  method  serves  as  a com- 
petitive connection  annihilation  operator.  By  combining  these  two  methods  its 
is  possible  to  generate  small  networks  which  have  similar  testing  and  training 
accuracy,  good  generalization,  from  small  training  sets.  Our  findings  demon- 
strate that  for  a difficult  character  recognition  problem  the  number  of  weights 
in  a fully  connected  network  can  be  reduced  by  over  90%. 


1 Introduction 

The  size  and  the  complexity  of  neural  network  applications  has  grown  rapidly. 
The  search  for  small  networks  with  large  information  content  and  generaliza- 
tion capability  is  ongoing.  Most  of  the  optimization  strategies  are  a trade-off 
between  error  and  network  complexity.  The  known  optimization  schemes [1,2, 3] 
have  used  this  trade-off  to  minimize  the  cost  function.  Among  various  com- 
plexity measures,  Vapnic-Chervonenkis  (VC)  dimensionality  [4],  concentrates 
on  information  content  and  distribution  of  information  in  the  network.  The 
error  term  associated  with  increasing  VC  dimension  can  be  reduced  by  greatly 
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expanding  the  size  of  training  set  or  by  reducing  the  VC  dimension  of  the  net- 
work. 

Boltzmann  methods  have  been  used  as  a statistical  method  for  combinatorial 
optimization  and  for  the  design  of  learning  algorithms [5, 6].  This  method  can  be 
used  in  conjunction  with  a supervised  learning  method  to  dynamically  reduce 
network  size.  The  strategy  used  in  this  research  is  to  remove  the  weights  using 
Boltzmann  criteria  during  the  training  process.  Information  content  is  used  as 
a measure  of  network  complexity  for  evaluation  of  the  resulting  network. 

The  competing  mechanisms  involved  when  the  Boltzmann  method  is  used  in 
conjunction  with  SCG  are  shown  in  table  1.  This  table  lists  five  points  where 
these  two  methods  can  be  compared.  The  Boltzmann  method  is  self-organizing 
while  the  SCG  method  is  a supervised  learning  method.  The  Boltzmann  method 
seeks  to  minimize  the  the  number  of  weights  while  maintaining  the  information 
content  of  the  network.  The  SCG  method  seeks  to  minimize  an  error  function 
on  the  training  set.  The  important  controlling  parameter  for  the  Boltzmann 
method  is  the  information  in  the  network  is  the  iteration  time,  t,  as  t -*  oo. 
The  controlling  informational  parameter  for  the  SCG  method  is  the  information 
provided  at  t = 0 in  the  initial  weights.  The  algorithmic  control  in  the  Boltz- 
mann method  is  the  temperature  sequence  applied  during  the  iteration.  The 
equivalent  controlling  parameter  for  the  SCG  method  is  the  restart  sequence. 


Boltzmann  Method 
Self- Organization 
information  minimization 
generalization  in  testing 
Info  {t  — >•  oo) 
Temperature  sequence 


SCG  Method 
Supervised  Learning 
error  optimization 
error  in  training 
Info  {t  = 0) 
Restart  sequence 


Table  1:  Competing  mechanisms  when  Boltzmann  and  SCG  methods  axe  combined 
for  concurrent  network  optimization 
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2 Pruning  via  Boltzmann  Methods 

In  this  paper  a fully  connected  network  is  optimized  using  the  Scaled  Conjugate 
Gradient  method  (SCG)  developed  by  MoUer  [7]  and  modified  by  Blue  and 
Grother  [8].  The  SCG  method  is  used  as  a starting  network  for  the  Boltzmann 
weight  pruning  algorithm.  The  network  has  an  input  layer  with  thirty-two  input 
nodes,  a variable  size  hidden  layer  with  sixteen,  thirty-two  or  sixty-four  nodes 
and  an  output  layer  with  ten  nodes.  The  initial  network  is  a fully  connected 
network.  The  pruning  was  carried  out  by  selecting  a normalized  temperature, 
T,  and  removing  weights  based  on  a probability  of  removal: 

Pi  = exp(-|ti;,j/r) 

The  values  of  P,  are  compared  to  a set  of  uniformly  distributed  random 
numbers,  P,,  on  the  interval  [0, 1].  If  the  probability  P,  is  greater  than  P, 
then  the  weight  is  set  to  zero.  The  process  is  carried  out  for  each  iteration  of 
the  SCG  optimization  process  and  is  dynamic.  If  a weight  is  removed  it  may 
subsequently  be  restored  by  the  SCG  algorithm;  the  restored  weight  may  survive 
if  it  has  sufficient  magnitude  in  subsequent  iterations. 

The  dynamic  effect  of  this  is  shown  in  figure  1 for  five  temperatures  between 

0. 1  and  0.5  at  0.1  intervals,  starting  from  a fully  converged  cind  fully  connected 
network.  As  the  size  of  the  temperature  change  increases  the  number  of  weights 
removed  initially  increases,  but  the  effect  of  later  iterations  of  optimization  and 
pruning  is  to  decrease  the  rate  at  which  weights  are  removed.  The  number  of 
weights  in  the  initial  network  was  1386,  including  bias  weights.  At  all  tem- 
peratures the  initial  iterations  are  very  effective  in  reducing  the  weights.  The 
decrease  in  the  rate  of  pruning  is  the  result  of  a critical  phenomena  character- 
ized by  a critical  temperature,  Tc,  at  which  the  new  information  added  by  the 
SCG  training  balances  the  information  removed  by  pruning.  At  this  critical 
point  networks  trained  on  small  training  sets  will  achieve  identical  testing  and 
training  accuracy  even  when  tested  on  large  test  sets. 

The  effect  of  the  number  of  hidden  nodes  can  be  seen  in  figures  2,  3 and  4. 
Figure  2 shows  the  effect  on  the  network  with  32  hidden  nodes  used  in  figure 

1.  As  the  temperature  is  increased  the  accuracy  of  the  network  for  recognition 
decreases  slowly  for  temperatures  up  to  0.4.  As  the  temperature  approaches 
0.5  the  rate  of  weight  removal  shown  in  figure  1 slows  and  the  rate  of  accuracy 
decay  accelerates.  The  two  curves  plotted  are  the  training  set  and  testing  set 
accuracy  of  the  network.  The  training  set  accuracy  is  initially  greater  than  the 
testing  accuracy.  At  a critical  temperature,  Tc,  the  testing  accuracy  and  training 
accuracy  are  identical.  In  figure  2,  at  the  critical  temperature  of  0.58,  read  from 
figure  2,  chaotic  behavior  sets  in  the  vicinity  of  Tc  due  to  a critical  effects  of 
weight  removal.  The  behavior  of  the  32-64-10  network  in  figure  4 is  similar  to 
the  32-32-10  network.  The  32-16-10  network  in  figure  3 shows  an  increase  in 
temperature,  Tc,  and  a decrease  in  accuracy  at  Tc.  This  increase  in  Tc  is  caused 
by  the  reduced  set  of  possible  pruned  configurations  in  the  32-16-10  network; 
the  initial  32-16-10  network  is  too  small. 
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Figure  1:  Weights  removed  as  a function  of  iteration  and  temperature  for  T = 
0.1, 0.2,  0.3,  0.4,  0.5.  The  lower  curve  is  for  T = 0.1;  the  upper  curve  is  for  T = 0.5. 


Figure  2:  Change  in  testing  and  training  accuracy  as  a function  of  temperature  for  a 
32-32-10  network  after  1000  iterations  at  each  temperature. 
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Figure  3:  Change  in  testing  and  training  accuracy  as  a function  of  temperature  for  a 
32-16-10  network  after  1000  iterations  at  each  temperature. 


Figure  4:  Change  in  testing  and  training  accuracy  as  a function  of  temperature  for  a 
32-64-10  network  after  1000  iterations  at  each  temperature. 
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Figure  5:  Weight  distribution  below  Tc  at  T = 0.55. 


Figure  6:  Weight  distribution  above  Tc  sX  T — 0.6 
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Figure  7:  Information  in  weights,  ]Clog2(l^»l)5  below  Tc  bX,  T = 0.55. 
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Figure  8:  Information  in  weights,  I31og2(l^i|)?  above  Tc  at  T = 0.6. 
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3 Weight  Reduction  and  Information  Con- 
tent 

The  effect  on  the  information  content  of  the  network  can  be  evaluated  by  exam- 
ining the  the  distribution  of  weights  in  the  network  as  a function  of  temperature. 
Figure  5 shows  the  distribution  of  the  absolute  value  of  the  weights  at  a tem- 
perature near,  but  below,  Tg.  Figure  6 shows  the  distribution  of  the  absolute 
value  of  the  weights  at  a temperature  near,  but  above,  Tc. 

These  distributions  illustrate  the  mechanism  involved  in  the  collapse  of  test- 
ing and  training  accuracy  near  Tc.  The  accuracy  collapse  is  caused  by  the  large 
increase  in  weights  near  zero  created  by  the  most  recent  SCG  iteration.  In  a 
given  training  cycle  some  weights  are  removed.  If  these  weights  are  redundant 
they  will  be  compensated  for  by  other  weights  in  the  network.  If  these  weights 
are  critical  they  will  be  restored  by  the  SCG  optimization.  The  peak  in  the 
distribution  near  zero  in  both  figures  5 and  6 is  caused  by  this  process.  At  Tc 
the  SCG  creation  process  is  just  balanced  by  the  Boltzmann  pruning. 

The  effect  of  the  near-zero  weights  is  more  important  when  viewed  as  in- 
formation content.  The  VC  dimension  and  the  information  content  are  both 
approximately  2Z(1°62(I^«I)  + !)•  A weight  distributions  of  this  kind  are  shown 
in  figures  7 and  8 for  T above  and  below  Tc-  When  large  numbers  of  near-zero 
weights  exist,  their  contribution  to  the  sum  dominates  the  network  information. 
Under  these  conditions  the  network  is  dominated  by  recently  created  weights 
which  have  not  been  optimized  by  SCG  iterations.  This  lowers  network  accuracy 
without  reducing  VC  dimension. 

To  evaluate  the  generalization  capability  of  the  pruned  network  the  network 
associated  with  a temperature  T = 0.55  was  tested  on  a sample  of  221,000  digits 
[9].  The  predicted  accuracy  from  Tc  data  was  75.5%;  the  accuracy  achieved  in 
the  test  was  72.6%.  In  this  region  the  change  in  accuracy  of  the  network  is  about 
5%  for  each  AT  of  0.001  so  that  this  agreement  is  consistent  with  an  accuracy 
of  Tc  of  ±.0005  with  a value  of  Tc  — 0.582. 

4 Conclusions 

A method  of  network  optimization  has  been  developed  which  reduces  by  80%  to 
90%  the  number  of  weights  required  for  moderately  accurate  character  recogni- 
tion. The  method  is  based  on  achieving  equilibrium  between  the  information  in 
the  training  set  and  the  number  of  network  weights  by  concurrent  weight  cre- 
ation by  SCG  optimization  and  Boltzmann  weight  removal.  These  reductions 
allow  both  smaller  training  sets  and  smaller  classification  networks  to  be  used. 
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